context distillation
Training Plug-n-Play Knowledge Modules with Deep Context Distillation
Caccia, Lucas, Ansell, Alan, Ponti, Edoardo, Vulić, Ivan, Sordoni, Alessandro
Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.
- North America > United States (0.14)
- Europe > Austria > Vienna (0.14)
In-Context Learning Distillation for Efficient Few-Shot Fine-Tuning
Duan, Yifei, Li, Liu, Zhai, Zirui, Yao, Jinxia
Conventional solutions to few-shot learning model for the natural language inference task and employed generally fall into two categories: weights-updating knowledge distillation to internalize the context information, fine-tuning and prompt-based context learning. Each approach reducing model parameter from 1.3B to 125M and has significant limitations, particularly when scaling achieving a size reduction from 2.5GB to 0.25GB. Compared to larger models or deploying in resource-constrained to using in-context learning alone on similarly sized environments. Fine-tuning requires updating some or all models, this context distillation approach achieved a nearly model parameters, leading to high computational costs and 50% improvement in out-of-domain accuracy, demonstrating potential catastrophic forgetting.
Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
Hsu, Sheryl, Khattab, Omar, Finn, Chelsea, Sharma, Archit
The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Observing that LLMs can learn to search for relevant facts by trying different queries and learning to up-weight queries that successfully produce relevant results, we introduce Learning to Retrieve by Trying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Despite tremendous progress, large language models (LLMs) still often hallucinate, motivating significant interest in grounding LLM answers in verified sources (Guu et al., 2020; Komeili et al., 2022; PerplexityAI, 2024; Google, 2024; OpenAI, 2024). Unfortunately, simply retrieving semantically similar documents to the user question, as is prevalent in retrieval-augmented generation (RAG; Lewis et al. 2020) pipelines, tends to fail for complex information needs not answered directly by any individual document. To tackle this, multi-hop retrieval pipelines gather information incrementally over multiple steps of search. For example, if a user asks What is a good dinner place driving from the Bay Area to Lake Tahoe on Friday night to avoid traffic?, a grounded system might need to learn about towns en route Lake Tahoe from the Bay Area, followed by traffic forecast on I-80 and finally, restaurants in Auburn (and other towns). However, doing this successfully is hard as off-the-shelf LLM performance is often unsatisfactory, and producing supervision for the best search queries to generate in a sequence of "hops" is nontrivial and expensive. Recent work tackles this via prompt optimization and rejection fine-tuning given a downstream signal.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Efficient LLM Context Distillation
Upadhayayaya, Rajesh, Smith, Zachary, Kottmyer, Chritopher, Osti, Manish Raj
Given Large Language Models (LLMs) demonstrate proficiency this constrained context window, context distillation (CD) across diverse tasks but often require targeted adaptations extends accessible task-specific examples by internalizing for specific applications. Various methods have them, greatly increasing the number of available examples been proposed to facilitate this adaptation, including fewshot outside of the query prompt [1]. This not only limits the fine-tuning, in-context learning, and context distillation.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Dominican Republic (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
Comparative Analysis of Different Efficient Fine Tuning Methods of Large Language Models (LLMs) in Low-Resource Setting
Srinivasan, Krishna Prasad Varadarajan, Gumpena, Prasanth, Yattapu, Madhusudhana, Brahmbhatt, Vishal H.
In the domain of large language models (LLMs), arXiv:2305.16938 showed that few-shot full-model fine-tuning -- namely Vanilla Fine Tuning (FT) and Pattern-Based Fine Tuning (PBFT) --, and In-Context Learning (ICL) generalize similarly on Out-Of-Domain (OOD) datasets, but vary in terms of task adaptation. However, they both pose challenges, especially in term of memory requirements. In this paper, we further try to push the understanding of different fine-tuning strategies for LLM and aim to bring a myriad of these on the same pedestal for an elaborate comparison with full-model fine-tuning on two diverse datasets. To that end, we conducted a series of experiments, beginning with state-of-the-art methods like vanilla fine-tuning and Pattern-Based Fine-Tuning (PBFT) on pre-trained models across two datasets, COLA and MNLI. We then investigate adaptive fine-tuning and the efficiency of LoRA adapters in a few-shot setting. Finally, we also compare an alternative approach that has gained recent popularity -- context distillation -- with the vanilla FT and PBFT with and without few-shot setup. Our findings suggest that these alternative strategies that we explored can exhibit out-of-domain generalization comparable to that of vanilla FT and PBFT. PBFT under-performs Vanilla FT on out-of-domain (OOD) data, emphasizing the need for effective prompts. Further, our adaptive-fine tuning and LoRA experiments perform comparable or slightly worse than the standard fine-tunings as anticipated, since standard fine-tunings involve tuning the entire model. Finally, our context distillation experiments out-perform the standard fine-tuning methods. These findings underscore that eventually the choice of an appropriate fine-tuning method depends on the available resources (memory, compute, data) and task adaptability.
- North America > United States > New York (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
Learning by Distilling Context
Snell, Charlie, Klein, Dan, Zhong, Ruiqi
Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with informative instructions, and they acquire new reasoning capabilities by generating a scratch-pad before predicting the final answers. However, they do not \textit{internalize} these performance gains, which disappear when the context tokens are gone. Our work proposes to apply context distillation so that a language model can improve itself by internalizing these gains. Concretely, given a synthetic unlabeled input for the target task, we condition the model on ``[instructions] + [task-input]'' to predict ``[scratch-pad] + [final answer]''; then we fine-tune the same model to predict its own ``[final answer]'' conditioned on the ``[task-input]'', without seeing the ``[instructions]'' or using the ``[scratch-pad]''. We show that context distillation is a general method to train language models, and it can effectively internalize 3 types of training signals. First, it can internalize abstract task instructions and explanations, so we can iteratively update the model parameters with new instructions and overwrite old ones. Second, it can internalize step-by-step reasoning for complex tasks (e.g., 8-digit addition), and such a newly acquired capability proves to be useful for other downstream tasks. Finally, it can internalize concrete training examples, and it outperforms directly learning with gradient descent by 9\% on the SPIDER Text-to-SQL dataset; furthermore, combining context distillation operations can internalize more training examples than the context window size allows.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (3 more...)
- Media > Film (0.47)
- Leisure & Entertainment (0.47)